52 research outputs found

    PIOMan : un gestionnaire d'entrées-sorties générique

    Get PDF
    Les mécanismes de communications sont, la plupart du temps, implémentés efficacement et fournissent de bonnes performances lorsque le transfert des données a lieu dans un environnement non perturbé, c'est-à-dire que les ressources (processeur, bus mémoire, cartes réseau) sont disponibles. Mais à l'heure du développement des architectures multi-coeurs et de la multiplication du nombre d'unités de calcul, ces performances sont difficiles à assurer : l'usage de threads à l'intérieur des applications entraîne une détérioration du temps de réaction et donc des performances. Dans cet article, nous présentons un gestionnaire d'entrées-sorties faisant collaborer une bibliothèque de communication et un ordonnanceur de threads afin de conserver une grande réactivité quelque soit le contexte d'exécution. Ce gestionnaire, à la fois générique et portable, permet aux communications de tirer profit du multithreading, notamment en assurant une progression des communications en arrière-plan de façon transparente pour l'application

    Bibliothèque de communication multi-threadée pour architectures multi-coeurs

    Get PDF
    National audienceL'architecture des grappes de calcul a énormément évolué depuis quelques années. Alors qu'il y a peu la plupart des noeuds ne comportaient que quelques coeurs de calcul, les machines équipées de dizaines de c{\oe}urs deviennent monnaie courante. Cette évolution du matériel s'est accompagnée d'un changement des modèles de programmation : les approches purement MPI laissent la place à des modèles mélangeant passage de messages et multi-threading. Lors de la conception de bibliothèques de communications modernes, il faut donc prendre en compte les accès concurrents et les problèmes de scalabilité liés aux processeurs multi-coeurs. Cet article présente différentes approches pour concevoir une bibliothèque de communication adaptée aux architectures actuelles. Nous étudions l'impact sur les performances de ces méthodes et plusieurs techniques permettant d'exploiter les coeurs inutilisés sont détaillées. Les évaluations montrent que de tels mécanismes permettent de répartir la charge due aux traitements des réseaux et de recouvrir les communications par du calcul

    Gestion de la réactivité des communications réseau

    Get PDF
    Communication mechanisms are most of the time implemented efficiently and therefore lead to good performances as long as the data transfer is done in an ``undisturbed environment'', i.e. the resources (CPU, memory bus, network interface) are fully available. On the other hand, communication reactivity during computing phases, i.e. the ability to react quickly to the network, is hard to ensure. However, by centralizing the network accesses, it is possible to maintain a good reactivity even during computing phases. In that context, we describe a new server, interacting with both a communication library and a thread scheduler, able to provide a good reactivity whatever the execution context is

    A scalable and generic task scheduling system for communication libraries

    Get PDF
    International audienceSince the advent of multi-core processors, the physionomy of typical clusters has dramatically evolved. This new massively multi-core era is a major change in architecture, causing the evolution of programming models towards hybrid MPI+threads, therefore requiring new features at low-level. Modern communication subsystems now have to deal with multi-threading: the impact of thread-safety, the contention on network interfaces or the consequence of data locality on performance have to be studied carefully. In this paper, we present PIOMan, a scalable and generic lightweight task scheduling system for communication libraries. It is designed to ensure concurrent progression of multiple tasks of a communication library (polling, offload, multi-rail) through the use of multiple cores, while preserving locality to avoid contention and allow a scalability to a large number of cores and threads. We have implemented the model, evaluated its performance, and compared it to state of the art solutions regarding overhead, scalability, and communication and computation overlap

    An analysis of the impact of multi-threading on communication performance

    Get PDF
    International audienceAlthough processors become massively multicore and therefore new programming models mix message passing and multi-threading, the effects of threads on communication libraries remain neglected. Designing an efficient modern communication library requires precautions in order to limit the impact of thread-safety mechanisms on performance. In this paper, we present various approaches to building a thread-safe communication library and we study their benefit and impact on performance. We also describe and evaluate techniques used to exploit idle cores to balance the communication library load across multicore machines

    A multicore-enabled multirail communication engine

    Get PDF
    International audienceThe current trend in clusters architecture leads toward a massive use of multicore chips. This hardware evolution raises bottleneck issues at the network interface level. The use of multiple parallel networks allows to overcome this problem as it provides an higher aggregate bandwidth. But this bandwidth remains theoretical as only a few communication libraries are able to exploit multiple networks. In this paper, we present an optimization strategy for the NewMadeleine communication library. This strategy is able to efficiently exploit parallel interconnect links. By sampling each network's capabilities, it is possible to estimate a transfer duration a priori. Splitting messages and sending chunks of messages over parallel links can thus be performed efficiently to reach the theoretical aggregate bandwidth. NewMadeleine is multithreaded and exploits multicore chips to send small packets, that involve CPU-consuming copies, in parallel

    Runtime function instrumentation with EZTrace

    Get PDF
    International audienceHigh-performance computing relies more and more on complex hardware: multiple computers, multi-processor computer, multi-core processing unit, multiple general purpose graphical processing units... To efficiently exploit the power of current computing architectures, modern applications rely on a high level of parallelism. To analyze and optimize these applications, tracking the software behavior with minimum impact on the software is necessary to extract time consumption of code sections as well as resource usage (e.g., network messages). In this paper, we present a method for instrumenting functions in a binary application. This method permits to collect data at the entry and the exit of a function, allowing to analyze the execution of an application. We implemented this mechanism in \eztrace and the evaluation shows a significant improvement compared to other tools for instrumentation

    A multithreaded communication engine for multicore architectures

    Get PDF
    International audienceThe current trend in clusters leads towards an increase of the number of cores per node. As a result, an increasing number of parallel applications is mixing message passing and multithreading as an attempt to better match the underlying architecture's structure. This naturally raises the problem of designing efficient, multithreaded implementations of MPI. In this paper, we present the design of a multithreaded communication engine able to exploit idle cores to speed up communications in two ways: it can move CPU-intensive operations out of the critical path (e.g. PIO transfers offload), and is able to let rendezvous transfers progress asynchronously. We have implemented these methods in the PM2 software suite, evaluated their behavior in typical cases, and we have observed good performance results in overlapping communication and computation

    Improving Reactivity and Communication Overlap in MPI using a Generic I/O Manager

    Get PDF
    International audienceMPI applications may waste thousands of CPU cycles if they do not efficiently overlap communications and computation. In this paper, we present a generic and portable I/O manager that is able to make communication progress asynchronously using tasklets. It chooses automatically the most appropriate communication method, depending on the context: multi-threaded application or not, SMP machine or not. We have implemented and evaluated our I/O manager with Mad-MPI, our own MPI implementation, and compared it to other existing MPI implementations regarding the ability to efficiently overlap communication and computation

    A sampling-based approach for communication libraries auto-tuning

    Get PDF
    International audienceCommunication performance is a critical issue in HPC applications, and many solutions have been proposed on the literature (algorithmic, protocols, etc.) In the meantime, computing nodes become massively multicore, leading to a real imbalance between the number of communication sources and the number of physical communication resources. Thus it is now mandatory to share network boards between computation flows, and to take this sharing into account while performing communication optimizations. In previous papers, we have proposed a model and a framework for on-the-fly optimizations of multiplexed concurrent communication flows, and implemented this model in the \nm communication library. This library features optimization strategies able for example to aggregate several messages to reduce the number of packets emitted on the network, or to split messages to use several NICs at the same time. In this paper, we study the tuning of these dynamic optimization strategies. We show that some parameters and thresholds (\rdv threshold, aggregation packet size) depend on the actual hardware, both host and NICs. We propose and implement a method based on sampling of the actual hardware to auto-tune our strategies. Moreover, we show that multi-rail can greatly benefit from performance predictions. We propose an approach for multi-rail that dynamically balance the data between NICs using predictions based on sampling
    • …
    corecore